Lesson 2: Advanced Machine Learning

Welcome to the introduction to Deep Learning! Using Keras and Tensorflow, you’ll learn how to: 1) create a fully-connected neural network architecture, 2) apply neural nets to regression and classification, 3) train neural nets with stochastic gradient descent, and 4) improve performance with dropout, batch normalization, and other techniques.

1. Deep learning and its application to ecology

1.1 Artificial neural networks and deep learning

Artificial neural networks are a class of machine learning algorithms that are loosely based on biological neural networks. Advances in the availability of high performance computers allow neural networks to be larger and deeper. Deep learning is a subfield of machine learning primarily concerned with these deep neural networks.

MLPs (multilayer perceptron) traditionally comprise at least 3 layers: an input layer, a hidden layer, and an output layer.

MLPs often comprise many more layers, allowing them to solve more complex problems. Much of the recent focus has been concentrated in the subfield of deep learning, which is concerned with the study of neural networks comprising many layers. New algorithm was required.

1.2 Deep learning for ecological networks

An ecological network is typically described as a graph G = (V, E) with vertices V representing species, and edges E describing the interactions between the species. This framework allows edges to be either directed or undirected, which means that the interaction can flow from one species to another (in the case of a food web), or the interaction can be mutual (in the case of a host-parasite network). In some networks, the edge can also be weighted, to describe the strength of an interaction. This is often used in food webs, to show the rate of flow of energy between species. Binary networks can be considered a special case of weighted networks, where the edge weights are simply set to either 0 or 1. Additionally, the edges can be labelled to represent different types of interaction - these networks are known as multilayer networks. The vertices of an ecological network can also be labelled, for example in bipartite networks where the vertices are labelled so that they belong to one of two groups. Networks such as host-parasite or seed dispersal networks are generally bipartite, since a single species is rarely both a host and parasite.

For mathematical convenience, a network can be described by an adjacency matrix. A network with n species would be represented by a real-valued n × n adjacency matrix A, where \(A_{ij}\) is the weight of the interaction between the i-th species and the j-th species. This means that the adjacency matrix is symmetric in undirected networks, and asymmetric in directed networks.

Topological analysis

A key aim of ecosystem analysis is to determine how stable an ecosystem is, and how it might react to environmental change. Therefore, a large amount of research has been concerned with determining the stability of ecosystems using theoretical techniques. These ideas were introduced by Elton, who described a community matrix M of size n × n for which the (i, j)-th entry represents the impact that a species h has on another species j around an equilibrium point of some unobserved dynamical system. This allowed stability analysis techniques from dynamical systems literature to be used, where a system is considered stable to small perturbations if the eigenvalues λ of M all have negative real parts. The original study was concerned with random community matrices, but the same ideas have since been refined and applied to realistic community matrices. This area of research has been used to contribute to the ongoing debate about the relationship between biodiversity and ecosystem stability.

Species importance metrics

It is often of interest to ecologists which species are the most influential within their ecosystem. These species are known as keystone species. This has consequences in conservation, since the loss of a keystone species could lead to the collapse of an entire ecosystem. Traditionally, keystone species were identified from their natural history, but it is often difficult to experimentally verify this. This has led to the introduction of graph-theoretic species importance metrics which aim to discover keystone species by considering the topology of the network.

Network structure prediction

Machine learning algorithms are able to discover patterns in data, and use those patterns to make predictions about previously unseen data samples. This had led to widespread use of ML in many different domains to automate laborious tasks which are expensive in terms of both time and money. Ecological networks are generally constructed in such ways, and therefore it is highly desirable to develop machine learning algorithms which automate the process.

Missing link prediction

Missing link prediction - as the name suggests - is the problem of predicting which links are missing from a network. There is a wealth of literature in network analysis concerning this problem and its applications to a range of complex networks. Recently, attention has turned to the problem of link prediction in the context of ecological networks. This has immediately useful practical applications in ecology. The randomness present in an ecosystem makes it unlikely that all actual interactions are observed when collecting data to construct ecological networks. Therefore, the use of link prediction algorithms could help ecologists to discover actual unobserved interactions when used in conjunction with traditional ecological network construction methods.

2. Deep learning including fully-connected neural networks

2.1 Linear model with a single or multiple input

The most impressive advances in artificial intelligence have been in the field of deep learning. Deep learning is an approach to machine learning characterized by deep stacks of computations, which can disentangle the kinds of complex and hierarchical patterns in the most challenging real-world datasets. Their power and scalabile neural networks have become the defining model of deep learning.

Neural networks are composed of neurons, where each neuron individually performs only a simple computation. The fundamental component of a neural network is the individual neuron(see box).

Box : The models with single and multiple inputs

Single Input

Though individual neurons will usually only function as part of a larger network, it’s often useful to start with a single neuron model as a baseline. Single neuron models are linear models.

For example, training a model with ‘sugars’ (grams of sugars per serving) as input and ‘calories’ (calories per serving) as output in the dataset of cereals, we might find the bias is \(b=90\) and the weight is \(w=2.5\). We could estimate the calorie content of a cereal with 5 grams of sugar per serving like this:


Multiple Inputs

The 80 Cereals dataset has many more features than just ‘sugars’. What if we wanted to expand our model to include things like fiber or protein content? That’s easy enough. We can just add more input connections to the neuron, one for each additional feature. To find the output, we would multiply each input to its connection weight and then add them all together.


The formula for this neuron would be \(y=w_{0}x_{0}+w_{1}x_{1}+w_{2}x_{2}+b\). A linear unit with two inputs will fit a plane, and a unit with more inputs than that will fit a hyperplane.


Here describe a diagram of a neuron or unit. Does the formula \(y=wx+b\) look familiar? It’s the slope-intercept equation, where \(w\) is the slope and \(b\) is the y-intercept.

The Linear Unit: y=wx+b

The input is \(x\). Its connection to the neuron has a weight which is \(w\). Whenever a value flows through a connection, you multiply the value by the connection’s weight. For the input \(x\), what reaches the neuron is \(w * x\). A neural network “learns” by modifying its weights.

The b is a special kind of weight we call the bias. The bias doesn’t have any input data associated with it; instead, we put a 1 in the diagram so that the value that reaches the neuron is just \(b\) (since \(1 * b = b\)). The bias enables the neuron to modify the output independently of its inputs.

The \(y\) is the value the neuron ultimately outputs. To get the output, the neuron sums up all the values it receives through its connections. This neuron’s activation is \(y = w * x + b\), or as a formula \(y = wx+b\).

# Create a network with 1 linear unit
#model = keras.Sequential([
#    layers.Dense(units=1, input_shape=[3])
#])

The first argument, units, defining how many outputs we want. In this case just predicting ‘calories’, so units=1. The second argument, input_shape, telling Keras the dimensions of the inputs. Setting input_shape=[3] ensures the model will accept three features as input (‘sugars’, ‘fiber’, and ‘protein’). This model is now ready to be fit to training data!

2.2 Stacking many linear models for deep neural networks

We can build neural networks capable of learning the complex kinds of relationships that deep neural nets work for. The key idea here is that we can combine and modify these single units to model more complex relationships. We have some nonlinearity with activation functions, let’s see the box2 to understand how we can stack layers to get complex data transformations.

Box : The models with single and multiple inputs

Layers

Neural networks typically organize their neurons into layers. We collect together linear units, and then get a dense layer through a common set of inputs. A layer can be, essentially, any kind of data transformation. Many layers, like the convolutional and recurrent layers, transform data through use of neurons and differ primarily in the pattern of connections they form. Others though are used for feature engineering or just simple arithmetic.

You could think of each layer in a neural network as performing some kind of relatively simple transformation. Through a deep stack of layers, a neural network can transform its inputs in more and more complex ways. In a well-trained neural network, each layer is a transformation getting us a little bit closer to a solution.

The Activation Function

Dense layers by themselves can never move us out of the world of lines and planes. Without activation functions, neural networks can only learn linear relationships. What we need are activation functions to fit curves. The most common activation function is the rectifier function max(0,x).

The rectifier function has a graph that’s a line with the negative part “rectified” to zero. Applying the function to the outputs of a neuron will put a bend in the data, moving us away from simple lines.

When we attach the rectifier to a linear unit, we get a rectified linear unit or ReLU. Applying a ReLU activation to a linear unit means the output becomes \(max(0, w * x + b)\), which we might draw in a diagram like:



We build deep neural networks by stacking layers inside a Sequential model. By adding an activation function after the hidden layers, we gave the network the ability to learn more complex (non-linear) relationships in the data.

Now, notice that the final (output) layer is a linear unit (meaning, no activation function). That makes this network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.

The Sequential model we’ve been using will connect together a list of layers in order from first to last: the first layer gets the input, the last layer produces the output. This creates the model in the figure above:

#model = keras.Sequential([
    # the hidden ReLU layers
#    layers.Dense(units=4, activation='relu', input_shape=[2]),
#    layers.Dense(units=3, activation='relu'),
#    # the linear output layer 
##])

As for the dataset of of cereals, we’ve chosen a three-layer network with over 1500 neurons. This network should be capable of learning fairly complex relationships in the data.

3. Training neutral network models

Like machine learning tasks, training a network model in fact is to adjust its weights so that it can transform the features (inputs) into the target (output). the dataset of cereals, for instance, we want a network that can take each cereal ‘sugar’, ‘fiber’, and ‘protein’ content and produce a prediction for that cereal ‘calories’. If successfully train a network to do that, its weights must represent some relationships between those features and that target.

Box : The models with single and multiple inputs

The Loss Function

The loss function measures the disparity between the the target’s true value and the value the model predicts. Different problems call for different loss functions. A common loss function for regression is the mean absolute error (MAE), i.e., abs(y_true - y_pred).

Besides MAE, other loss functions you might see for regression problems are the mean-squared error (MSE) or the Huber loss (both available in Keras).

The Activation Function

Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps like this:

  • Sample some training data and run it through the network to make predictions.
  • Measure the loss between the predictions and the true values.
  • Finally, adjust the weights in a direction that makes the loss smaller.

Each iteration’s sample of training data is called a minibatch (often “batch”), while a complete round of the training data is called an epoch. The number of epochs you train is how many times the network will see each training example.

The animation shows the linear model, which is trained with SGD. The pale red dots depict the entire training set, while the solid red dots are the minibatches. Every time SGD sees a new minibatch, it will shift \(w\) and \(b\) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit.

Learning Rate and Batch Size

The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Fortunately, for most work it won’t be necessary to do an extensive hyperparameter search to get satisfactory results. Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning. Adam is a great general-purpose optimizer.



After understanding the two things, we should compile in the optimizer and loss function.

Now we’re ready to start the training! We’ve told Keras to feed the optimizer 256 rows of the training data at a time (the batch_size) and to do that 10 times all the way through the dataset (the epochs).

4. Improving the performance of network models

4.1 Increasing capacity and including early stop

Keras will keep a history of the training and validation loss over the epochs that it is training the model. We’ll examine the learning curves for evidence of underfitting and overfitting for correcting it.

Learning Curves

We train a model by choosing weights or parameters that minimize the loss on a training set. We need to evaluate it on a new set of data, the validation data, to accurately assess a model’s performance.

When we train a model we’ve been plotting the loss on the training set epoch by epoch. To this we’ll add a plot the validation data too. These plots we call the learning curves.

When a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned. Ideally, we would create models that learn all of the signal and none of the noise. We can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease.

Capacity

A model’s capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.

You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.

Early Stopping

When a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn’t decreasing anymore. Interrupting the training this way is called early stopping.

Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured. This ensures that the model won’t continue to learn noise and overfit the data.

In Keras, we include early stopping in our training through a callback. A callback is just a function you want run every so often while the network trains. The early stopping callback will run after every epoch.

4.2 Dropout and Batch Normalization

There are some special layers that do not contain any neurons, but that they are added prevent overfitting and stabilize training.

Dropout

To break up these conspiracies, we randomly drop out some fraction of a layer’s input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.

You could also think about dropout as creating a kind of ensemble of networks. The predictions will no longer be made by one big network, but instead by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual. (If you’re familiar with random forests as an ensemble of decision trees, it’s the same idea.)

In Keras, the dropout rate argument rate defines what percentage of the input units to shut off. Put the Dropout layer just before the layer you want the dropout applied to:

Batch Normalization

Another special layer is “batch normalization” (or “batchnorm”), which can help correct training that is slow or unstable.

With neural networks, it’s generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn’s StandardScaler or MinMaxScaler. Now, if it’s good to normalize the data before it goes into the network, maybe normalizing inside the network would be better! This special kind of layer can do this. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then putting the data on a new scale with two trainable rescaling parameters.

It seems that batch normalization can be used at almost any point in a network. You can put it after a layer…

… or between a layer and its activation function:

And if you add it as the first layer of your network it can act as a kind of adaptive preprocessor, standing in for something like Sci-Kit Learn’s StandardScaler.

5. Setup R environment for deeep learning

5.1 R packages for neural networks

There are some packages that can fit basic neural networks. The nnet package is one package and it can fit feed-forward neural networks with one hidden layer. The neuralnet package fits neural networks with multiple hidden layers and can train them using back-propagation. We will also use the RSNNS package, which is an R wrapper of the Stuttgart Neural Network Simulator (SNNS). The RSNNS package makes many model components from SNNS available, making it possible to train a wide variety of models.

The deepnet package provides a number of tools for deep learning in R. Specifically, it can train RBMs and use these as part of DBNs to generate initial values to train deep neural networks. The deepnet package also allows for different activation functions, and the use of dropout for regularization.

5.2 Deep learning frameworks for R

There are a number of R packages available for neural networks, but few options for deep learning. The h2o (https://www.h2o.ai/) is an excellent, general machine learning framework written in Java, and has an API that allows you to use it from R. However, most deep learning practitioners prefer other deep learning libraries, such as TensorFlow, CNTK, and MXNet. There is a good choice of deep learning libraries that are supported in R—MXNet and Keras. Keras is actually a frontend abstraction for other deep learning libraries, and can use TensorFlow in the background. We will use MXNet, Keras, and TensorFlow in this class.

MXNet

MXNet is a deep learning library developed by Amazon. It can run on CPUs and GPUs. Apache MXNet is a flexible and scalable deep learning framework that supports convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It can be distributed across multiple processors/machines and achieves almost linear scale on multiple GPUs/CPUs. It is easy to install on R and it supports a good range of deep learning functionality for R.

Keras

Keras is a high-level, open source, deep learning framework created by Francois Chollet from Google that emphasizes iterative and fast development; it is generally regarded as one of the best options to use to learn deep learning. Keras has a choice of backend lower-level frameworks: TensorFlow, Theano, or CNTK, but it is most commonly used with TensorFlow. Keras models can be deployed on practically any environment, for example, a web server, iOS, Android, a browser, or the Raspberry Pi.

To learn more about using Keras in R, go to https://keras.rstudio.com; this link will also has more examples of R and Keras, as well as a handy Keras cheat sheet that gives a thorough reference to all of the functionality of the R Keras package. To install the keras package for R, run the following code:

#devtools::install_github("rstudio/keras")
#library(keras)
#install_keras()

Please pay much attention to the link of the different languages, and understand it for operating on them.

For ubuntu, you should make sure which python that you can use, and create a virtual environment, such as python3.8-venv, for running keras in Rstudio.

#reticulate::py_config()